feat: generate GeoJSON from PostgreSQL instead of re-reading CSV#404
feat: generate GeoJSON from PostgreSQL instead of re-reading CSV#404ThibaudDauce wants to merge 13 commits intomainfrom
Conversation
663d2a3 to
3999e41
Compare
Stream GeoJSON features directly from the database using a cursor, avoiding a third read of the CSV file. PostgreSQL builds the JSON with json_build_object, so no Python-level casting is needed. Also extract _detect_geo_columns to deduplicate the geo column detection logic, and add detailed timer marks for the geojson and pmtiles conversion steps.
3999e41 to
3181c12
Compare
|
Still a bug when a lot of column, will push a fix next week! |
|
I might wait for that PR to be finished to be able to base this new PR on it, to separate |
# Conflicts: # tests/test_analysis/test_parquet_export.py
Pierlou
left a comment
There was a problem hiding this comment.
Thanks for the refactor 🙏 The flow looks good to me (though I haven't gone through the technicity of the SQL part), but I have a doubt regarding the type of the properties we export, and a couple of remarks.
| return query, params | ||
|
|
||
|
|
||
| async def csv_to_geojson_from_db( |
There was a problem hiding this comment.
We may want to keep the function name close to its parquet counterpart
| async def csv_to_geojson_from_db( | |
| async def save_as_geojson_from_db( |
There was a problem hiding this comment.
... wouldn't db_to_geojson and db_to_parquet be even more explicit, and more consistent with namings like csv_to_geojson?
| # ending up here means we either have the exact lat,lon format, or NaN | ||
| # skipping row if NaN | ||
| if row[geo["latlon"]] is None: | ||
| elif "latlon" in geo or "lonlat" in geo: |
There was a problem hiding this comment.
Keeping the insight
| elif "latlon" in geo or "lonlat" in geo: | |
| elif "latlon" in geo or "lonlat" in geo: | |
| # skipping row if geo data is None |
|
|
||
| # latlon/lonlat columns can contain values like "[48.8566, 2.3522]" or "48.8566 , 2.3522" | ||
| # Both versions below strip spaces and brackets, then split on comma. | ||
|
|
There was a problem hiding this comment.
The insight is already given within the function, this feels misplaced here
| # latlon/lonlat columns can contain values like "[48.8566, 2.3522]" or "48.8566 , 2.3522" | |
| # Both versions below strip spaces and brackets, then split on comma. |
| for col in property_cols: | ||
| params.append(col) | ||
| placeholder = f"${len(params)}::text" | ||
| properties_fragments.append(f"{placeholder}, {_quote_ident(_db_col_name(col))}") |
There was a problem hiding this comment.
NIT
| for col in property_cols: | |
| params.append(col) | |
| placeholder = f"${len(params)}::text" | |
| properties_fragments.append(f"{placeholder}, {_quote_ident(_db_col_name(col))}") | |
| for idx, col in enumerate(property_cols): | |
| params.append(col) | |
| properties_fragments.append(f"${idx + 1}::text, {_quote_ident(_db_col_name(col))}") |
| properties_fragments = [] | ||
| for col in property_cols: | ||
| params.append(col) | ||
| placeholder = f"${len(params)}::text" |
There was a problem hiding this comment.
Does that mean we're exporting all values as text? In which case it is not what we want (columns that contain ints/floats should be exported as such)
There was a problem hiding this comment.
No here we talk about the columns/keys, I've added a comment in df12267
| first = True | ||
| async for row in cursor: | ||
| if not first: | ||
| f.write(",\n") | ||
| f.write(row[0]) | ||
| first = False |
There was a problem hiding this comment.
I don't think I understand the first trick: why don't we f.write(f"{row[0]},\n") for all rows? (in which case we should finally f.write("]"))
There was a problem hiding this comment.
It will have a trailing comma in the JSON?
bolinocroustibat
left a comment
There was a problem hiding this comment.
Thank you. A few remarks otherwise LGTM
| with output_file_path.open("w") as f: | ||
| f.write('{"type": "FeatureCollection", "features": [\n') | ||
| first = True | ||
| async for row in cursor: |
There was a problem hiding this comment.
What happens in case of an error while parsing the rows?
Should we delete output_file_path in a try/finally if we detect failure after partial write, or wrap the stream loop in try/finally to remove the output file on any error after open?
There was a problem hiding this comment.
The caller csv_to_geojson_and_pmtiles in csv.py:193-194 already wraps the call in a try/except that calls remove_remainders(resource_id, ["geojson", "pmtiles", "pmtiles-journal"]). So if an error occurs during streaming, the partial file is cleaned up by the caller.
This is the same pattern as the CSV path — csv_to_geojson doesn't handle cleanup on error either, it's delegated to the caller.
| upload_to_minio: bool = True, | ||
| ) -> tuple[int, str | None] | None: | ||
| """Generate a GeoJSON file by streaming features directly from PostgreSQL.""" | ||
| geo = _detect_geo_columns(inspection) |
There was a problem hiding this comment.
NIT: a type hint would be nice here for readability:
geo: dict | None = _detect_geo_columns(inspection)
There was a problem hiding this comment.
_detect_geo_columns already have a return-type? You're IDE should provide you the type hint on hover, no?
There was a problem hiding this comment.
My IDE doesn't :) I type hint when not obvious at first sight
| return query, params | ||
|
|
||
|
|
||
| async def csv_to_geojson_from_db( |
There was a problem hiding this comment.
... wouldn't db_to_geojson and db_to_parquet be even more explicit, and more consistent with namings like csv_to_geojson?
| output_file_path: Path, | ||
| upload_to_minio: bool = True, | ||
| ) -> tuple[int, str | None] | None: | ||
| """Generate a GeoJSON file by streaming features directly from PostgreSQL.""" |
There was a problem hiding this comment.
NIT: csv_to_geojson_from_db has a one-line docstring while csv_to_geojson documents args, behavior (skipped rows), and return values. Aligning the DB variant (even briefly: inputs, streaming, parity goal with the CSV path) might make it easier to maintain and choose between the two in case we have to.
| async for row in cursor: | ||
| if not first: | ||
| f.write(",\n") | ||
| f.write(row[0]) |
There was a problem hiding this comment.
NIT: Might be a good opportunity to ad a column alias on the outer SELECT in _build_feature_sql, to avoid using row[0], prefering somtehing like row["feature_json"], that would massively improve readability imho.
| template["features"] = streamable_list(get_features(file_path, inspection, geo)) | ||
|
|
||
| with output_file_path.open("w") as f: | ||
| json.dump(template, f, indent=4, ensure_ascii=False, default=str) |
There was a problem hiding this comment.
GeoJSON files are often large, pretty-printing can add a meaningful amount of redundant bytes on disk and for uploads, even if compression narrows the gap.
Also indent=4 gets a pretty-printed JSON while the database path stream does not re-indent the whole document like indent=4 would. Someone diffing two exports, grepping a file might think something broke when they see the compact stream.
Should we treat compact as the normal export format for both paths, dropping indent=4 on the CSV path? And assume if a human needs to read the file, s.he will format it automatically anyway?
There was a problem hiding this comment.
Yes but this path is only for csv_to_geojson (the line only moved), I think we can leave it since it will be removed soon?
| # Convert to GeoJSON — from DB if available and enabled, otherwise from CSV file | ||
| if config.DB_TO_GEOJSON and table_name: | ||
| result = await csv_to_geojson_from_db( | ||
| result = await save_as_geojson_from_db( |
There was a problem hiding this comment.
I think we should use this opportunity to rename:
save_as_geojson_from_db -> db_to_geojson
save_as_parquet_from_db-> db_to_parquet
Those names match the source_to_target style we already use (csv_to_parquet, csv_to_db, geojson_to_pmtiles, parquet_to_db), they also line up nicely with our config flags.
Optionally to match with that we could also rename save_as_parquet to rows_to_parquet
Stream GeoJSON features directly from the database using a cursor, avoiding a third read of the CSV file. PostgreSQL builds the JSON with json_build_object, so no Python-level casting is needed.
Also extract _detect_geo_columns to deduplicate the geo column detection logic.
Will allow to send this part in another job.
On
MN_07_latest-2025-2026.csv: 50s → 18s (-64%)On
joconde.csv: 75s → 43s (-42%)Same as #402 but for GeoJSON